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Abstract 

We propose an algorithm, semismooth Newton coordinate descent (SNCD), for the 
elastic-net penalized Hnber loss regression and qnantile regression in high dimensional 
settings. Unlike existing coordinate descent type algorithms, the SNCD npdates each 
regression coefficient and its corresponding snbgradient simnltaneonsly in each itera¬ 
tion. It combines the strengths of the coordinate descent and the semismooth Newton 
algorithm, and effectively solves the compntational challenges posed by dimensional¬ 
ity and nonsmoothness. We establish the convergence properties of the algorithm. 

In addition, we present an adaptive version of the “strong rnie" for screening predic¬ 
tors to gain extra efficiency. Throngh nnmerical experiments, we demonstrate that 
the proposed algorithm is very efficient and scalable to nltra-high dimensions. We 
illnstrate the application via a real data example. 

Keywords: High-dimensional regression; Nonsmooth optimization; Elastic-net; Newton 
derivatives; Solution path; Subgradient updating. 


1 INTRODUCTION 

Consider the linear regression model 

Vi = + xj I3 + ei 

where Xi is a p-dimensional vector of covariates, {/3o, P) are regression coefficients, and Ei 
is the random error. We are interested in the high dimensional case where n and the 
model is sparse in the sense that only a small proportion of the coefficients are nonzero. In 
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such a scenario, a key task is identifying and estimating the nonzero coefficients. A popular 
approach is the penalized regression 


min-V£(|/i-/3o-x7/3) + AP(/3), ( 1 . 1 ) 

I3q,P n 

i 

where £ is a generic loss function and p is a penalty function with a tuning parameter A > 0. 
We consider the elastic-net penalty (Zou and Hastie, 2005) 

P{(3) = Pa{(3) = a||/?||i + (1 - a)^||/5||2,0 < a < 1, 


which is a convex combination of the lasso (Tibshirani, 1996) {a = 1) and the ridge penalty 

{a = 0 ). 

A common choice for £ is the squared loss £{t) = t^/2, corresponding to the least squares 
regression in classical regression literature. Although the squared loss is analytically simple, 
it is not suitable for data in the presence of outliers or heterogeneity. Instead, we could 
consider two widely used robust alternatives, the Huber loss (Huber, 1973) and the quantile 
loss (Koenker and Bassett Jr, 1978). 

The Huber loss is 


£{t) = h-y{t) 


p 

27’ 


if |f| < 7, 


( 1 . 2 ) 


if|f|>7, 

where 7 > 0 is a given constant. This function is quadratic for |t| < 7 and linear for |t| > 7 . 
In addition, it is convex and first-order differentiable. These features allow it to combine 
analytical tractability of the squared loss for the least squares and outlier-robustness of the 
absolute loss for the LAD regression. 

The quantile loss is 


£(t) = Prit) = tij — I(t < 0)),t G M, (1.3) 

where 0 < r < 1. This is a generalization of the absolute loss with r = 1/2. Rather 
than the conditional mean of the response given the covariates, quantile regression models 
conditional quantiles. For heterogeneous data, the functional relationship between the 
response and the covariates may vary in different segments of its conditional distribution. 
By choosing different r, quantile regression provides a powerful technique for exploring 
data heterogeneity in addition to outlier-robustness. 
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The theoretical properties of these two regression models have been systematically 
studied, yet there are relatively few researches on the algorithmic aspect, especially the 
penalized versions for high-dimensional data. Holland and Welsch (1977) proposed an it¬ 
eratively re-weighted least squares algorithm for the unpenalized Huber loss regression. 
However, this algorithm does not have a natural extension to the penalized version. For 
unpenalized quantile regression, Portnoy et ah (1997) formulated its dual form as a lin¬ 
ear programming problem and proposed an interior point method to solve it. The lasso 
penalized version can be shown to have a similar dual form, except that it becomes 
(n -|- p)-dimensional with p extra constraints due to the penalty. Thus it can be solved 
using the same algorithm and this extension was implemented in the R package quantreg 
(http://cloud.r-project. org/package=quantreg). However, it is not clear if this ap¬ 
proach is scalable to high-dimensional problems. Osborne and Turlach (2011) proposed a 
homotopy algorithm for computing solution paths of lasso penalized quantile regression, 
where the lasso penalty was formulated as a constraint l/^il — which is not directly 

comparable with the unconstrained formulation considered here. 

In recent years coordinate descent algorithms have proven to be very effective for path- 
wise optimization of penalized regression models, see for example, Friedman et ah (2007) 
for lasso and fused lasso penalized least squares, Friedman et ah (2010) for elastic-net pe¬ 
nalized GLM, and Breheny and Huang (2011) for nonconvex penalized least squares and 
logistic regression. The loss functions considered by these authors are either quadratic, or 
twice differentiable which can be approximated quadratically via Taylor expansion. Hence 
the coordinate descent iterations have close-form solutions. However, the Huber loss is 
only hrst-order differentiable and the quantile loss is nondifferentiable, hence the above 
approach does not work. Wu and Lange (2008) proposed a coordinate descent algorithm 
for lasso penalized LAD regression that amounts to computing a weighted median at each 
iteration, but did not provide any guarantee for convergence. Recently Peng and Wang 
(2015) proposed a QICD algorithm for nonconvex penalized quantile regression that ma¬ 
jorizes the penalty functions by weighted lasso and then solves the problem with coordinate 
descent. The authors proved convergence of QICD to a stationary point, for which the ma- 
jorization step plays a critical role. But when the lasso penalty is used, which does not 
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need to be majorized, the algorithm becomes the same as the one in Wu and Lange (2008). 
In addition, it appears that neither algorithm can be easily generalized to the elastic-net 
penalty with 0 < a < 1. 

In this paper, we propose a novel semismooth Newton coordinate descent (SNCD) 
algorithm for computing solution paths of the elastic-net penalized Huber loss regression 
and quantile regression. This algorithm combines the coordinate descent algorithm with 
the semismooth Newton algorithm (SNA) for solving nonsmooth equations. It is highly 
efficient and scalable in high-dimensional settings. Unlike a typical coordinate descent 
method which only updates the primal variable /?, the SNCD utilizes both the primal 
and the dual information (via subgradient) in its iterations. In addition, an adaptive 
version of the strong rule (Tibshirani et ah, 2012) for screening predictors is incorporated 
to gain extra efficiency. We also provide an implementation of SNCD through a publicly 
available R package hqreg (https://cran.r-project.org/web/packages/hqreg/index. 
html) which currently supports the Huber loss, the quantile loss and the squared loss. This 
algorithm can be generalized to other problems with nonsmooth loss functions, like the 
linear support vector machine with the hinge loss. 

The rest of this paper is organized as follows. In section 2 we introduce SNCD for 
the penalized Huber loss regression and establish its convergence. In section 3 we extend 
SNCD to penalized quantile regression. Section 4 describes the adaptive strong rule. In 
section 5 we investigate the performance of hqreg, our implementation of SNCD, through 
simulation studies and real datasets. 

2 SNCD FOR PENALIZED HUBER LOSS REGRESSION 
2.1 Background on Newton Derivatives and SNA 

Based on the concepts of generalized Jacobian (Clarke, 1983) and semismoothness (Mifflin, 
1977), Qi and Sun (1993) established superlinear convergence of a Newton-type method 
for solving hnite-dimensional nonsmooth equations, hence the name Semismooth Newton 
Algorithm (SNA). The Newton differentiability was introduced later for more general prob¬ 
lems including inhnite-dimensional cases (Chen et al., 2000; Ito and Kunisch, 2008). It has 
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a simpler formulation and is actually a milder condition than semismoothness. Newton 
derivatives can be calculated via basic algebra and chain rules as indicated in Lemmas A.l, 
A .2 and A.3 in Appendix A. 

Definition 2.1. A function F : —)■ M} is said to be Newton differentiable at z G MF 

if there exists an open neighborhood A/'(z) and a mapping H : N'{z) —)■ such that 

{H{z + h) : z + h G M{z), h 7 ^ 0} is uniformly bounded in spectral norm induced by the 
Euclidean norm and 

\\F{z + h) - F{z) - H{z + h)h \\2 = 0 ( 11 ^ 112 ) as h 0. 

Here FI is called a Newton derivative for F at z. The set of all Newton derivatives at z is 
denoted as VnF{z). 

A function F : —)■ W is said to be locally Lipschitz continuous at z if there exists 

L{z) > 0 such that for all sufficiently small h, 

\\F{z + h) - F{z)h < L\\hh. 

Then F is Newton differentiable at 2 ; if and only if F is locally Lipschitz continuous at 2 ; 
(Chen et ah, 2000). This gives a simple characterization of the Newton differentiability. 
The following result due to Chen et ah (2000) establishes the superlinear convergence of 
SNA under the Newton differentiability. 

Theorem 2.1. Suppose that F : M™' —)■ M™' is Newton differentiable at a solution z* of 
F{z) = 0. Let H be a Newton derivative for F at z*. Suppose there exists a neighborhood 
^{z*) and M > 0 such that H{z) is nonsingular and ||iL(z)“^|| < M for all z G Af{z*), 
then the Newton-type iteration 

ff+i h{z'^)-^F{z^), A; = 0,1,... 

converges superlinearly to z* provided that \\z^ — z *\\2 is sufficiently small, where z^ is the 
initial value. 
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2.2 Algorithm 


2.2.1 Description 

Consider the Huber loss £ = then (1.1) becomes 

min/^^(/3o,/3) = - ^h^{yi - I3 q - xll3) + \Pa{l3). (2.1) 

/3o,/3 n 

I 

Fix A and a, and denote the optimizer by (/3o,/3). Since the objective function in 
(2.1) is convex, (/3o,/d) satishes the necessary and sufficient Karush-Kuhn-Tucker (KKT) 
conditions. Let d\t\ denote the set of subgradients of the absolute value function | ■ | at f, 
then it can be shown that 


s G d\t\ if and only if f = S{t + s), (2.2) 

where S is the soft-thresholding operator with threshold 1, i.e. S{z) = sgn(^)(|; 2 | — 1 ) + . 
As shown in Appendix A, combining this fact with some other convex analysis concepts 
(Rockafellar, 1970; Combettes and Wajs, 2005), the KKT conditions of (2.1) can be written 
as 

- ^0 - xj^) = 0 , 

' Ei - 3o - xj^)xij + Xa?j + A(1 - a)Pj = 0, (2.3) 

- S0j -I- Sj) = 0 , j = 1 ,... ,p, 
where 'Sj G d\l3j\ and the derivative of is given by 

,, , , [y, if \t\ < 7, 

Kit) = I " 

[sgn(f), if \t\ > 7 . 

In this way the optimization problem (2.1) is transformed into a root hnding problem 
for a system of nonsmooth equations (2.3). A straightforward approach is applying SNA 
to the entire system of equations. As discussed later in section 2.3, this approach contains 
many matrix operations that cause O(np^) computational cost per iteration, which severely 
limits its scalability. 

For better efficiency and scalability, we propose a new algorithm. Semismooth Newton 
Coordinate Descent (SNCD), that combines SNA with cyclic coordinate descent in solving 
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these equations. Similar to the Gauss-Seidel method for linear equations, SNCD solves the 
equations of (2.3) in a cyclic fashion to avoid cumbersome matrix operations. We cycle 
through {(3o,(3,s) in a pairwise fashion: at each step, a pair {Pj,Sj) (and Pq by itself) is 
updated by solving the corresponding part of (2.3), while the other variables are hxed at 
their current values (3k, Sk, k ^ j. Specihcally, we solve the following equations at each step; 


• For 


+ -^(1 - = 0 ) 

(3j - S{(3j + Sj) = 0, 


(2.4) 


• For (3o: 



(2.5) 


where Vi = yi - Po - xj P, r ^ 1,..., n. 

Note that (2.4) is exactly the KKT conditions of 




and (2.5) the KKT condition of 


mmfH{Po,/3i, ...). 

PO 


Hence SNCD can be seen as a special type of coordinate descent. 
Denote 


^ (t) = -/(|t| < 7), 
7 


( 2 . 6 ) 


then G VArh(^(f),Vf G M. The SNCD iterations proceed as follows: 


(i) Updating (3q. Let 



Since 



we update (3q by solving (2.5) via SNA 



(ii) Updating (/?j, Sj). Let 



+ ^ijPj - Xij^l)Xij + ^(^^2 + A(1 - a)zi 

Zi — S{zi + Z2) 


-Z2 +?>gn{zi +Z2) ii\zi + Z 2 \>l, 
zi if \zi + ^ 2 ! < 1, 

we solve for {(3j, Sj) from (2.4) via SNA in two types of npdates: 

(a) \(3j + Sj| > 1. For z with \zi + Z 2 \ > 1, a Newton derivative of Fj at z is 


(2.7) 


H,iz) = 


^ Y.i V’ 7 (u + Xijf3j - XijZi)x‘f. + A(1 - a) Xa 


e VNFj{z). ( 2 . 8 ) 


Hence the npdate is 


-1 

_1 


Aj 

-Hj(Aj,Sj) ^Fj(Aj,Sj) = 

0 1 ^T,ih'^ifi)x:ij-\asgn(l3j+Sj)-X[l-a)i3j 

. . 


A' 


sgn(/3j + Sj) 


(b) \l3j + Sjl < 1. For z with \zi + Z 2 \ < 1, a Newton derivative of Fj at z is 


= 


k AA'^i + + A(1 - a) Xa, 


e VNFj(z). (2.9) 


Hence the npdate is 






ft 


--ffftft,ftr‘ft(ft,ft) = 


Xa 


2 . 2.2 Convergence 

Since SNCD hts in the general coordinate descent framework, its convergence property 
follows from the convergence results for coordinate descent (Tseng, 2001). To apply the 
results, we hrst show that the optimization problem is of the form 

m 

min f{zi, ...,Zm) = /o(a, ...,z^) + ^ fjAj), 

i=i 
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where /o, /i,..., /m are convex, /q is first-order differentiable and the level set {z : f{z) < 
f{z^)} is bounded given any initial point z^. A key fact to notice about this formulation 
is that the nondifferentiable part Ylj ^lust be separable. The penalized Huber loss 

regression model in ( 2 . 1 ) clearly satisfies these conditions. 

At each coordinate update, SNA is applied to solve the equations, which requires non¬ 
singularity of the Newton derivative and the uniform boundedness of its inverse. When 
updating (3o, these requirements are met if | ~ /^o ~ ^ 7/^)1 is bounded away from 

0. This is true as long as there is at least one observation with \yi — — xjI3\ < 7 . When 

updating l3j, Sj, it can be shown via some algebra that a sufficient condition is 0 < a < 1 
and V ’7 is bounded. The latter always holds since 'ip'y{t) G {0, 1 / 7 } for any t. 

In order for this local SNA strategy to work well, we also need the starting point and 
the optimal point in each coordinate update to be sufficiently close. Denote the globally 
initial fn value by ffj. Since fn decreases along SNCD iterations and the level set {(/do, (3) : 
fH{Po,P) < fn} is bounded, the closeness requirement is satisfied if the diameter of the 
set is sufficiently small. 

The above discussions are summarized in the following result. 

Theorem 2.2. For problem (2.1), /et A > 0, a G (0,1) and the initial fn value be f%. 
Assume for every point {(3o, (3) in the level set C = (3) : fH{(3o,/3) < ffj} there exists 

i G {1, .. . ,n} such that \yi — (3o — xj f3\ < 7 . Then SNCD iterations converge to a global 
minimizer provided that the diameter of C is sufficiently small. 

2.2.3 Pathwise Optimization 

To actually implement the algorithm, we still need to consider an important issue: its 
convergence relies on a good initial point, which is usually not guaranteed in practice. 
For low-dimensional problems we can use line search to ensure global convergence with 
an arbitrary initial point, but since line search methods involve considerable amounts of 
function and gradient evaluations, they are not well-suited for high-dimensional cases. 

The strategy of pathwise optimization with warm start can help globalize the conver¬ 
gence of the algorithm. With a decreasing sequence of A values, this strategy sequentially 
solves the optimization problem at each A^ using the optimizer at the previous Afc_i as 
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the initial value. When A^-i, Xk are reasonably close, the initial point {/3o{Xk-i), /3{Xk-i)) 
will be near the optimizer {/3o{Xk),/3{Xk)) as well. Hence each optimization problem along 
the path is warm-started with a good initial point, and fast convergence can be achieved. 
This strategy generates a solution path, which in turn will be useful for tuning parameter 
selection. 


2.3 Comparison with SNA and Existing Coordinate De¬ 
scent Type Algorithms 


2.3.1 SNA for Penalized Huber Loss Regression and Its Computational Bot¬ 
tleneck 


Denote S{z) = {S{zi),... ,S{zp)y and d{yo,/3) = {h'yyi - /3o - xj ^),..., hyi/n - /3o - 
x^/3))~^, then the KKT conditions (2.3) can be written compactly as 


F{f3o,(3,s) = 


-ACd(/3o,/?) 

— Xx^d{yo, fd) + Xas + A(1 — a)/3 
/d-S(/d + s) 


= 0 . 


( 2 . 10 ) 


It is easy to verify F is Newton differentiable, then SNA can be directly applied here 
for solving F(/3o, /d, s) = 0. See Appendix B for details. 

In terms of computational cost, the hrst concern is about matrix inversion, since the 
Newton derivative of F is a (l-|-2p) x (l-|-2p) matrix, for which inversion becomes intractable 
when p is large. However, the decomposition (2.7) leads to an “active set strategy” that 
helps reduce the dimension. Given the /cth iteration (/1 q, s^), dehne the active set Ak 
and its complement Bk hy 


Ak = {j : + sjl > 1} and Bk = {j : < 1}. (2.11) 

Then the Newton-type iteration of SNA is decomposed into two parts Ak and Bk and only 
the computation of requires inverting a matrix, the dimension of which is only 

(1-|- |Afc|) X (1-|- |Afc|). In general, \Ak\ can be as large as p. But since pathwise optimization 
is implemented, the algorithm is warm-started at each A value. Hence Ak is usually not 
too much different from the support of the optimizer, which tends to be a sparse subset of 
{l,...,p}. 
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The real bottleneck is in matrix mnltiplication. Let be as in (2.6). Let X* = (1„ X) 
andTfc = Then as shown in Appendix B, 

each iteration includes re-computing and re-partitioning X*~^d/kX*, which involves 0{np^) 
arithmetic operations that become formidable for large p. The diagonality of and the 
symmetry of could be utilized to reduce computation, but the magnitude remains 

0{np^). Since ~ ~ ^ caching all the (1-l-p) x (l-|-p) 

matrices x*x*^ would also speed up the computation, but since there are n such matrices, 
such an implementation would be memory-inefficient. 

2.3.2 SNCD vs. SNA 

The two algorithms mainly differ in the following aspects: 

• Consider a full update on {(3q,(3,s) as one iteration. The computational cost per 
iteration of SNCD is 0{np), compared with 0{np^) for SNA. 

• The SNCD iterations consist of univariate and bivariate updates only while SNA 
involves matrix inversions. 

• While SNA has locally superlinear convergence rate in theory, SNCD is at most 
linear. It is a worthwhile compromise, however, considering that SNCD reduces the 
computational cost per iteration from 0{np^) to 0{np) and that warm-starting due 
to pathwise optimization strategy allows SNCD to converge quickly. 

• In practice, SNCD is much faster; and SNCD always converges while SNA diverges 
in some high-dimensional cases even when pathwise optimization is used. 

2.3.3 SNCD vs. Existing Coordinate Descent Type Algorithms 

SNCD also differs from the existing coordinate descent algorithms for penalized regression 
(Friedman et ah, 2007; Friedman et al., 2010; Simon et ah, 2011; Breheny and Huang, 
2011) in the following aspects: 

• It generalizes coordinate descent to work on a wider class of models where the loss 
functions, like the Huber loss, only need to be hrst-order differentiable. As shown in 
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the next section, it is also extended to a case with a nondifferentiable loss, i.e. the 
qnantile loss, via smoothing approximation. 

• It is directly motivated from the KKT conditions as a root-hnding method, where 
the subgradients s/s are treated as independent variables that are connected with 
/dj’s through the equation — S{l3j + Sj) = 0 . 

• Each pair of {Pj,Sj) is updated simultaneously with different formulas for two situ¬ 
ations \Pj + Sjl > 1 and |/3j + Sj\ < 1. This is quite different from the coordinate 
descent algorithms mentioned above that only update the coefficients /3j’s. 

3 SNCD FOR PENALIZED QUANTILE REGRESSION 
3.1 Description 

For the quantile loss function i = p^-, (1.1) becomes 

min/Q(/3o,/3) = priVi - f3o - xj(3) + XPa{f3). (3.1) 

/3o,/3 n ^ 

I 

SNCD cannot be directly applied to this problem since it requires the hrst-order deriva¬ 
tives of the loss function, but p-^ is not differentiable. However, note that 

Pr{t) = (1 - r)t_ + r4 = ^ {\t\ + (2r - l)t} . 

Since h^{t) —)■ |t| as 7 —)■ 0+, Prit) ~ | {h-y{t) + (2r — l)f} for small 7 and the solutions to 
penalized quantile regression can be approximated by 

min fHA{/3o,f3) = - (3o - xjf3) + {2 t - l){yi -/3o - xj(3)} + XPa{(3), (3.2) 

where “HA" stands for Huber approximation. This problem is easier to handle since its 
loss function is hrst-order differentiable. The following result provides theoretical support 
for this smoothing approximation. 

Theorem 3.1. Given any A>0, 0<r<l and { 7 fc} converging to 0, let {/3ok,/3k) be. 
a minimizer 0 ////^(/lo,/d; A, r, 7 ^,). Then every cluster point of sequence {(/dofc)/^fc)} is a 
minimizer of fqifdo, (3] A, r). 
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Now we can derive the KKT conditions and apply SNCD to solve (3.2). Due to its 
similarity to the penalized Huber loss regression, we omit the details. At each iteration, 
with the current estimates denoted by (/3o, /3, s) and residuals by r,, the SNCD updates are 


(i) For l3o: 


(ii) For 


/^o t— /3o + 


Ei {h;(A) + 2r- 1} 


(a) If |/3j + Sjl > 1, then 

^ Ei + 2r - 1} Xij - Xasgn0j + Sj) - A(1 - a)^j 


(^j + 

Sj ^ sgn0j + §j). 


^Ei^7(D)4 + ^(i -«) 


(b) If \/3j + Sj\ < 1, then 
f5j ■<— 0 , 

, ^ E* { ^7(^*) + 2^ - 1} + Pj-h Ei ^7 (d)^p 

A« 

The previous discussions on convergence and pathwise optimization also apply here. 
And similar to Theorem 2.2, we have the following result. 

Theorem 3.2. For problem (3.1), /ef A > 0, a G (0,1) and the initial fnA value be Jha- 
Assume for every point (/Iq, P) in the level set C = {(/lo, P) '■ fnAiPo, P) < Jha} there exists 
i G {1,.. ., n} such that \yi — (3 q — xj/3\ <7 . Then SNCD iterations converge to a global 
minimizer provided that the diameter of C is sufficiently small. 


3.2 The Choice of 7 Values 

For the approximation to work well, we need to use a sufficiently small 7 ; but when 7 
gets too close to 0, the algorithm becomes ill-conditioned. Therefore we designed a data- 
dependent heuristic method for picking appropriate 7 values. At each A^, we determine 
7 fc depending on the residuals ffs given by the previous optimizer (/lo(Afc-i),/9(Afc_i)) as 
follows. 
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i. Initialize residuals ri i/i; 


ii. For each A^: 

(a) 'jk t— 10 -th percentile of 

(b) 7 fc ^ niin{ 7 fc, 7 fc_i}; 

(c) 7 fc ^ niax{ 7 fc, 0 . 001 }; 

(d) solve the problem with 7 ^., A^ and update r^’s at each iteration. 

In step (a) we pick a value smaller than the magnitudes of 90% of all residuals for 
which the loss function is the same as the quantile loss so the approximation should work 
well. This also keeps 7 ^ above the magnitudes of 10% of the residuals, which ensures the 
numerical stability of the algorithm. Bracketing in (b) and (c) are additional safeguards 
for stability. 

3.3 Related Convergence Results 

The key to the smoothing approximation is the fact that h^{t) converges to |f| as 7 tends 
to 0. In fact, it is also easy to see that with 7 as a scaling factor, converges to 

the squared loss ^ when 7 goes to inhnity. Hence, in the same spirit of Theorem 3.1, we 
also show the connections between the penalized Huber loss regression and two important 
regression models with respectively the absolute loss and the squared loss, i.e. the Least 
Absolute Deviations (LAD) and the Least Squares (LS). 

To simplify the notation, let 6 = (/do, /3) and P(-) be a general penalty function. Denote 

min/H(0;A,7) = ^J2ihyiyi-(^o-xJ/3) + XP{(3), 

u 

min/A(0;A) = 7 Ei I?/* “/^o - + AP(/?), 

u 

min/s(0;A) = - Po - x] + \P{P). 

u 

Then we have /a( 6 ';A) as 7 -)■ 0; fs{0;X) as 7 cx). 

And the following results establish the convergence between their optimizers. 
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Theorem 3.3. Given any A > 0 and { 7 ^} converging to 0, let 9^ he a minimizer of 
fH{9-,\,'jk)- Then every cluster point of seguence {9k} is a minimizer of fA{9]\). 

Theorem 3.4. Given any A > 0 and { 7 ^} converging to 00 , let 9k he a minimizer of 
A/ 7 fc, 7 fc). Then every cluster point of seguence {9k} is a minimizer of fs{9]\). 

Therefore, the penalized Huber loss regression bridges the gap between LAD and LS 
regression as 7 varies from 0 to 00 . The solutions of the penalized Huber loss regression 
constitute a rich spectrum from the solution of LAD regression to that of LS regression. 
This property gives us more flexibility in fitting high-dimensional regression models. 

4 ADAPTIVE STRONG RULE FOR SCREENING 

PREDICTORS 

Tibshirani et ah (2012) proposed the (sequential) strong rule for screening out predictors in 
pathwise optimization of penalized regression models for computational efficiency. However, 
when applied to the penalized Huber loss regression and quantile regression, we discover 
that the strong rule suffers from the issue of “violations" that is explained below. To deal 
with this issue and enhance algorithmic stability, we develop an adaptive version of the 
strong rule. 

We first describe the strong rule. Consider a general elastic-net penalized regression 

1 " 

min - - (3q- x] jd) -F AP„(/1). 

i=\ 

where t is convex and differentiable. Then the optimizer (/?o(-^) 5 /S(-^)) satisfies the KKT 
conditions 

-7Ei = 0, 

' -7 Ei - 3o - xj]3)xij + XaSj + A(1 - a)^j = 0, 

Jjed\Pj\, j = l,...,p. 

The unpenalized intercept /3o is always in the model, so there is no screening rule for it. 
For (dj, let Cj{X) = — A ^-£'(?/* — {do — xj(d)xij. Assume each Cj is a-Lipschitz continuous, 

|cj(A) — Cj(A')| < a|A — A'l, for every A, A' > 0. (4.1) 
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Then at each new in the solution path, given the previous optimizer /3{Xk-i)) 

and the corresponding Cj(Afc_i)’s, the strong rule discards predictor j if 

|cj(Afc_i)| < a(2Afc — Afc_i). (4.2) 

The reasoning is as follows. Assume (4.1) and (4.2) hold, since Xk-i > Xk, we have 

|cjr(Afc)| ^ l^i('^fc) Cj(Afc_i)| T |cjf(Afc_i)| 

< a{Xk-i — Xk) + «(2Afc — Afc-i) 

= «Afc. 

It follows that (3j{Xk) = 0, since by contradiction Pj{Xk) ^ 0 implies 'Sj{Xk) = sgn{(3j{Xk)) 
thus |cj(Afc)| = Afctt + Afc(l - a)\Pj{Xk)\ > Xka. 

The effectiveness of the strong rule relies on the assumption (4.1), which does not 
necessarily hold. So application of the rule should always be accompanied with a check 
of the KKT conditions. A pathwise optimization algorithm incorporating the strong rule 
proceeds as follows. 

For each Xk, 

(a) Compute the eligible set E = {j : |cj(Afc_i)| > a{2Xk — Afc_i)}; 

(b) Solve the problem using only the predictors in E] 

(c) Check KKT conditions on the solution; |cj(Afc)| < aXk for j G E^. We are done if there 
are no violations; otherwise, add violating indices to E and repeat (b) and (c). 

For the penalized least squares and logistic regression we have not encountered any 
violation, but it a different story for the penalized Huber loss regression and quantile 
regression. Using the strong rule for these two models, we often encounter a large number 
of violations, indicating that the rule may have been too restrictive. Since the algorithm is 
re-run each time violations are found, the overall efficiency is affected. Thus reducing the 
number of violations can enhance the algorithmic stability and lead to potential speedup. 

A simple approach is to use a multiplier M > 1 and relax the assumption (4.1) to the 
following: VA, A' > 0, 

|ci(^) -C(^')l < aM|A-A'|. 
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Accordingly, we will need to change (4.2) to 


|cj(AA;_i)| < a (Afc + M(Afc — Xk-i)) ■ 

However, this strategy does not work well in practice, since it is difficult to pre-deterniine 
an appropriate value of M that suits all values of A in the solution path. 

Hence we propose an “adaptive" version that allows M to vary with A. This rule 
automatically estimates a localized M(A) that varies and adapts to the trends of the solution 
paths, which reduces the number of violations by a large margin without sacrihcing speed. 
The idea is as follows. 

Let M(Ao) = 1. Then at each A^, 

(a) use M(Afc-i) to construct the eligible set, i.e. let 

^ — {j '■ |cj(Afc_i)| > a (Afc + M(Afc_i)(Afc — Afc-i))}; 


(b) solve the problem using only the predictors in and check KKT conditions as before; 
update E and repeat step (b) if violations occur; 


(c) compute M{Xk) based on the local trend of c/s: 


M(Afe) 


max|cj(Afc_i) - Cj(Afc)| 
i<i<p 


a(Afc-i — Afc) 


5 NUMERICAL RESULTS 


5.1 Optimization Performance for Penalized Quantile Re¬ 
gression 

As mentioned in the introduction, quantreg is another publicly available R package that 
supports lasso penalized quantile regression. Since our implementation employs an approx¬ 
imation model, it does not give “exact" solutions. Hence we want to compare its solutions 
with the ones computed by quantreg in terms of optimality. 

Unlike hqreg that computes a solution path, quantreg computes a single solution for 
a given A value, and it does not support the general elastic-net penalty with 0 < a < 1. 
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For comparison, we only consider lasso (a = 1). We first computed a solution path along 
100 A values using hqreg and then ran quantreg for each A value. Note that quantreg 
actually uses the formulation 

n ^ p 

- /^o - xll3) + A ■ - V |/3j| 

P0,P / ■'“ 

1=1 ]=l 


which does not have a 1/n scaling factor for the loss part and instead contains a 1/2 factor 
for the penalty. This is intended to treat the penalty terms as if median regression were 
performed on them (|A|/3j| = Due to this difference, for each A value used 

with hqreg, we equivalently supplied quantreg with 2nA. Also, while hqreg supports 
data preprocessing via the argument “preprocess" with 3 options “standardize", “rescale" 
and “none", quantreg does not provide such an option. So we standardized the data 
beforehand for all the real datasets involved in this section and used the standardized ones 
for comparison. Consequently, we set preprocess = "none" when calling hqreg. For 
quantreg, the latest version 5.24 was used. 

Let /( 3 (-;A) denote the objective function as in (3.1), and let /dhqreg and /^quantreg he 
the solutions given by the two packages, respectively. For a = 1 the model is not strictly 
convex, so in general it does not have a unique optimizer. Hence the values of the two 
solutions may not be very close. Instead, a reasonable approach is to compare the values of 
the objective functions /Q(/5hqreg) and fg{(3quantreg) ■ Specihcally, we made the comparisons 
based on the relative difference. 


D{X) 


fgiPhqreg^ A) /q(/ dquantreg) A) 

/q(/^ quantreg) A) 


(5.1) 


Two datasets were considered: 


• GDP (Koenker and Machado, 1999): consists of 161 observations on national GDP 
growth rates, recorded as “Annual Ghange Per Gapita GDP", and 13 covariates. The 
hrst 71 observations are from the period 1965-1975, and the rest from the period 
1975-1985. This dataset is available in quantreg via data(barro). 

• Riboflavin (Biihlmann et ah, 2014): gene-expression data for predicting log trans¬ 
formed riboflavin (vitamin B2) production rate in Bacillus subtilis. It contains 71 
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observations and 4088 features (gene expressions). This dataset is available in R 
package hdi via data(ribof lavin). For this task only 1000 features with the largest 
variances were used. 


GDP Riboflavin 



- quantreg - hqreg 

Figure 1: Values of objective functions with r = 0.5 along the solution path for GDP and 
riboflavin datasets. Solid line: quantreg, dashed line: hqreg. 


Dataset 

T 

minD(Aj) 

maxiA(Ai) 


0.25 

-2.1e-9 

1.5e-3 

GDP 

0.50 

-1.3e-10 

9.6e-4 


0.75 

-3.0e-10 

1.7e-3 


0.25 

6.5e-5 

2.6e-2 

Riboflavin 

0.50 

-3.6e-10 

2.0e-2 


0.75 

8.8e-5 

2.1e-2 


Table 1: The range of the relative differences D(Ai),l < i < 100, between hqreg and 
quantreg. 


Figure 1 displays the computed values of objective functions fQiPhqreg) and /^(/dquantreg) 
for r = 0.5 along the solution path for both datasets. There is no visually detectable 
discrepancy between the two lines. Hence we also computed the range of D{X) in each case 
and the results are listed in Table 1. In each case, the range of D(Ai)’s is extremely narrow 
and all values are very close to zero. This indicates the two packages indeed have similar 
performances. 
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Dataset 

T 

hqreg 

quantreg 

total Ai Aioo 


0.25 

0.018 

0.235 

0.007 

0.002 

GDP 

0.50 

0.018 

0.223 

0.002 

0.003 


0.75 

0.027 

0.240 

0.003 

0.002 


0.25 

2.501 

538.2 

3.630 

4.958 

Riboflavin 

0.50 

3.026 

531.6 

4.591 

4.984 


0.75 

2.922 

588.8 

7.119 

5.791 


Table 2: Running time (in seconds) for computing the solution paths 


We also report the running time in Table 2. The time for hqreg is for one call that hts 
the entire solution path, and the time for quantreg is the total of time recorded separately 
for each A. For all these cases hqreg is signihcantly faster than quantreg, although it may 
not be quite fair for quantreg since it does not rely on warm-start. The timings taken for 
quantreg on Ai and Aioo are also listed, which appear to be roughly the same. In the case 
of riboflavin data, the running time of quantreg on single A values is in fact longer than 
the time used by hqreg to compute the whole path. 

To further investigate their performances in various other scenarios, we ran a large set 
of experiments on 10000 datasets, each generated with the following settings: 

• the number of observations n and the number of features p are randomly selected 
from the set {20,100, 200, 500,1000, 2000, 5000}. 

• the number of nonzero coefficients is g = 0min(n,p) where 9 is uniformly sampled 
from (5%, 10%, 20%, 30%} and the coefficients values are randomly selected from 
{±1 ,...,± 10}. 

• each feature vector Xi is generated via Xij = Zij + O.buj, 1 < j < p, where Zij,UiS are 
i.i.d. standard gaussian, so that each pair of features has the same correlation 0.25. 

• the outcome pds are generated by pj = IQ + xj(5 + Si, where SiS are iid sampled from 
Student’s t distribution with df = 4. 

For each dataset and each r G (0.25, 0.5, 0.75}, we applied hqreg to compute an entire 
solution path and randomly selected an index k out of (10, 20 ,..., 100}, then ran quantreg 
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on Afc, the k-th A value for the solution path computed by hqreg. These experiments were 
performed in parallel via grid computing on a high performance cluster at the University 
of Iowa. 


O 

O 

b 

o 

> 

(0 in 

§ i 


Figure 2: Boxplots of the relative difference D on 10000 simulated datasets 

We calculate the relative difference D{X) for each pair of the solutions and summarize 
the results in three boxplots plotted on the logarithmic scale shown in Figure 2. We observe 
that the values of D have a narrow range between le-7 and le-1 with the majority falling 
below le-3, and are slightly smaller for r = 0.5. Besides, the distribution of D appears 
roughly symmetric on the logarithmic scale in each case. 

5.2 Timing Performance 

In addition to the Huber loss and the quantile loss, hqreg also supports the squared loss 
for the least squares which is not discussed in this paper, but its SNCD iterations can be 
derived in a similar way as the other two models. Here we consider their running time 
performances. 

We generated Gaussian data with n observations and p features, where each pair of 
features have an identical correlation p. To simplify settings and highlight the timing 
comparison based on the key parameter 7 and r, we set p = 0.25 and a = 0.9 for all cases. 
The responses were generated by 

Y = Y^Xfii + k-E 

j 



0.25 0.5 0.75 

T 
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where jSj = (—1)-^ exp(—(j — 1)/10), E ~ T{df = 4) and k is determined so that the 
signal-to-noise ratio is 3. 

5.2.1 Huber Loss Regression and Least Squares 

In this part, we compare the rnnning time of competing methods for the elastic-net penal¬ 
ized Hnber loss regression and least sqnares. For the Hnber loss, since there is no other 
algorithms, we consider only hqreg for SNCD with no variable screening (hqreg-NVS), 
SNCD with the adaptive strong rule (hqreg-ASR), and our implementation of pure SNA. 
In the experiments we considered 5 values of 7 ranging from 0.01 to 100. On the other 
hand, for the least squares we compared hqreg with a state-of-the-art coordinate descent 
algorithm implemented by R package glmnet. For glmnet, the latest version 2.0-5 was 
used which employs the strong rule for variable screening. All methods considered here are 
R functions, glmnet does all its computation in Fortran, hqreg does the computation in C, 
and the SNA implementation is also programmed in C with matrix operations performed 
via BIAS and LAPACK. 

We have found in practice that convergence of SNA has much higher reliance on initial 
points than SNCD, and it can fail if the A sequence is not dense enough. Hence we divided 
the experiments into two parts. In the first part for the Huber loss and the least squares 
together, we left out SNA and computed each solution path with the usual number of 100 A 
values. In the second part, we compared only SNA and SNCD(hqreg-NVS) for the Huber 
loss on dense lambda sequences each consisting of 10000 values. 

Table 3 shows average CPU timings for the first part. First compare the timings for 
the Huber loss. Across different values of 7, we observe that for both versions the timings 
increase when 7 is nearing 0, and stay almost the same for 7 > 1. And clearly hqreg-ASR 
that employs the adaptive strong rule is much faster and more scalable than hqreg-NVS 
that has no variable screening. For the least squares, hqreg-ASR and glmnet have similar 
performances except the case with n = 5000. Besides, we discover that the timings for the 
Huber loss regression with 7 > 1 are very close to those of the least squares. Considering 
that the Huber loss is more difficult to handle than the simple squared loss, the performance 
of hqreg is very impressive. 
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0.01 

0.1 

Huber 

7 

1 

10 

100 

Least Squares 




n = 1000 ,p 

= 100 


hqreg-NVS 

0.61 

0.09 

0.05 

0.04 

0.04 

0.03 

hqreg-ASR 

0.33 

0.06 

0.03 

0.02 

0.02 

0.02 

glmnet 

— 

— 

— 

— 

— 

0.02 




n = 5000,p 

= 100 


hqreg-NVS 

2.46 

0.48 

0.24 

0.20 

0.21 

0.14 

hqreg-ASR 

1.32 

0.30 

0.15 

0.12 

0.11 

0.07 

glmnet 

— 

— 

— 

— 

— 

0.02 




o 

II 

= 1000 


hqreg-NVS 

1.89 

0.39 

0.09 

0.08 

0.08 

0.05 

hqreg-ASR 

0.52 

0.11 

0.03 

0.02 

0.02 

0.02 

glmnet 

— 

— 

— 

— 

— 

0.02 




n = 100 ,p = 

= 5000 


hqreg-NVS 

8.70 

2.09 

0.46 

0.38 

0.38 

0.29 

hqreg-ASR 

0.85 

0.23 

0.09 

0.11 

0.08 

0.07 

glmnet 

— 

— 

— 

— 

— 

0.08 




n = 100 ,p = 

20000 


hqreg-NVS 

30.27 

8.88 

2.43 

2.45 

2.40 

1.23 

hqreg-ASR 

1.60 

0.54 

0.43 

0.31 

0.30 

0.32 

glmnet 

— 

— 

— 

— 

— 

0.30 




n = 100 ,p = 

100000 


hqreg-NVS 

175.81 

45.33 

11.23 

11.49 

11.41 

5.94 

hqreg-ASR 

4.50 

2.12 

1.69 

1.58 

1.53 

1.57 

glmnet 

— 

— 

— 

— 

— 

1.39 


Table 3; Running time (in seconds) for computing regularization paths for the elastic-net 
penalized Huber loss regression and least squares regression. Total time for 100 A values, 
averaged over 3 runs. 
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Table 4 shows average CPU timings for the second part. We observe that while SNCD 
converges in every case, SNA fails in the cases with large p and 7 = 0.1. When p is small, 
SNCD does not have much advantage. But when p increases, SNCD becomes considerably 
faster with an increasing speedup relative to SNA. These results show that SNCD is more 
stable and scalable than SNA. 



0.1 

7 

1 

10 


n = 

1000 ,p = 

= 100 

SNA 

3.98 

5.16 

5.44 

SNCD 

1.89 

1.27 

1.19 


n = 

5000, p = 

= 100 

SNA 

17.98 

24.42 

26.33 

SNCD 

12.38 

6.84 

6.31 


n = 

100 ,p = 

1000 

SNA 

X 

11.70 

10.47 

SNCD 

2.24 

1.62 

1.51 


n = 

100 ,p = 

5000 

SNA 

X 

98.66 

100.76 

SNCD 

9.87 

8.19 

8.96 


Table 4: Running time (in seconds) for comparing SNCD(hqreg-NVS) and SNA on the 
penalized Huber loss regression. “ x" represents early exit due to divergence at some A 
value. Total time for 10000 A values. 


5.2.2 Quantile Regression 

hqreg is faster than quantreg for the examples in section 5.1. However, quantreg does 
not implement pathwise optimization and rely on warm-start like hqreg does. Instead, for 
each supplied A value it has to solve the corresponding problem individually “from scratch". 
So it is not quite reasonable to compare quantreg with hqreg for computing the whole 
solution path. For this part, we compare only hqreg-NVS and hqreg-ASR. As shown in 
Table 5, hqreg-ASR is similar to hqreg-NVS in cases with p = 100 but considerably faster 
when p gets larger. hqreg-ASR also shows much better scalability with the dimension p. 


24 











T 


0.25 0.50 0.75 



n = 

1000,p 

= 100 

hqreg-NVS 

0.21 

0.18 

0.19 

hqreg-ASR 

0.13 

0.10 

0.11 


n = 

5000,p 

= 100 

hqreg-NVS 

0.56 

0.58 

0.54 

hqreg-ASR 

0.38 

0.42 

0.33 


n = 

100,p = 

= 1000 

hqreg-NVS 

10.77 

7.37 

11.90 

hqreg-ASR 

2.98 

1.94 

2.92 


n = 

100,p = 

= 5000 

hqreg-NVS 

47.08 

41.46 

58.97 

hqreg-ASR 

3.33 

2.92 

4.23 


n = 100,p = 

100000 

hqreg-ASR 

19.28 

12.43 

22.98 


Table 5: Running time (in seconds) for computing regularization paths for penalized quan¬ 
tile regression. Total time for 100 A values, averaged over 3 runs. 


5.3 Real Data Example 

We now compare the modelling performance of penalized Huber loss regression, quantile 
regression and least squares via an empirical analysis on a real dataset. It is a breast 
cancer gene expressions dataset that comes from the Cancer Genome Atlas (2012) project 
(http: //cancergenome. nih. gov/), obtained using Agilent mRNA expression microarrays. 
It contains expression measurements of 17814 genes on 536 patients, including BRCAl, the 
first gene identified to be associated with increasing risk of early onset breast cancer. Hence 
we regress the key gene BRCAl on the other genes to detect potential interconnections. 
Before fitting the models, we carried out the following two screening steps: remove any gene 
for which the range of the expression among all patients is less than 2, and remove any 
gene for which the sample correlation with BRCAl is less than 0.05. After the screening, 
there are 11562 genes left. Then we consider 7 elastic-net penalized linear regression models 
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using these genes as predictors: the least squares (LS-Enet); 3 Huber loss regression models 
with values of 7 being IQR(?/), IQR(|/)/2, IQR(?/)/10 respectively where IQR(?/) = 0.93, 
denoted as H-Enet(7 = IQR(?/)), H-Enet(7 = IQR(j/)/2), and (H-Enet(7 = IQR(?/)/10); 
3 quantile regression models with r = 0.25,0.50,0.75, denoted as Q-Enet(r = 0.25), Q- 
Enet(r = 0.50), and Q-Enet(r = 0.75). a = 0.9 is applied to the elastic-net penalty for all 
models. 

We conduct 50 random partitions. For each partition, we randomly select 300 patients as 
the training data and the other 236 as the testing data. A five-fold cross validation is applied 
to the training data to select the tuning parameter A. For prediction on the testing set, we 
consider two error measures. The first one is the commonly used mean absolute prediction 
error (MAPE). Since MAPE is not sensitive to heterogeneity and may not provide accurate 
assessment for Q-Enet(r = 0.25) and Q-Enet(r = 0.75) which use asymmetric losses, we 
also consider using the quantile loss pr to measure prediction performance as suggested 
in Wang et ah (2012). With pj. for corresponding quantile regression models and po.s for 
the least squares and the Huber loss regression models, we define quantile-based prediction 
error (QPE) as Y^iPAVi -yi)/n. 


Model 

LS-Enet 

H-Enet(7 = IQR(i/)) 
H-Enet(7 = IQR(i/)/2) 
H-Enet 7 = IQRfo /lO) 
Q-Enet(r = 0.25) 
Q-Enet(r = 0.50) 
Q-Enet(r = 0.75) 


Ave ^ nonzero 
114.30 (36.99) 
100.14 (44.70) 
82.06 (30.40) 
114.08 (30.73) 
94.58 (41.60) 

152.90 (51.96) 

104.90 (27.96) 


Ave MAPE 
0.335 (0.018) 
0.331 (0.018) 
0.310 (0.020) 
0.293 (0.021) 
0.373 (0.026) 
0.294 (0.021) 
0.317 (0.027) 


Ave QPE 
0.167 (0.009) 
0.166 (0.009) 
0.155 (0.010) 
0.146 (0.010) 
0.151 (0.010) 
0.147 (0.012) 
0.109 (0.007) 


Table 6: Analysis of the microarray dataset 


In Table 6 we report the average number of nonzero regression coefficients, the average 
MAPE and the average QPE, where numbers in the parentheses are the corresponding 
standard errors across the 50 partitions. The standard errors of the estimated numbers of 
nonzero coefficients are large relative to the averages, showing that all models are affected by 
noise to some extent. However, the standard errors for MAPE and QPE are relatively small, 
which indicates the prediction performances are stable. Among all models, H-Enet(7 = 
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IQR(?/)/10) and Q-Enet(r = 0.50) have the best performances in terms of MAPE, and 
Q-Enet(r = 0.75) dominates QPE, while LS-Enet performs poorly nnder both criteria. Q- 
Enet(r = 0.75) seems the best overall and it also tends to select sparser models compared 
to the aforementioned H-Enet (7 = IQR(|/)/10), Q-Enet(r = 0.50) or LS-Enet. 

For each model, different partitions may lead to different selection resnlts. We select 
LS-Enet, H-Enet (7 = IQR(?/)/10) and Q-Enet(r = 0.75) to represent their own classes, 
and report the names and the freqnencies of top genes selected (over 40 times) in Table 
7 where the genes are ordered alphabetically. We observe that some genes snch as DTL, 
NBR2, PSME3, RPL27 have high freqnencies with all three models, while genes snch as 
KHDRBSl do not. Overall, H-Enet (7 = IQR(r/)/10) and Q-Enet(r = 0.75) select more 
genes with high freqnencies than LS-Enet while their model sizes are smaller on average, 
especially Q-Enet(r = 0.75). It indicates these two models more consistently captnre the 
important genes. 


LS-Enet 

H-Enet (7 = 

IQR(r/)/10) 

Q-Enet(r 

= 0.75) 

Gene 

Freqnency 

Gene 

Freqnency 

Gene 

Freqnency 

DTL 

45 

G17orf53 

46 

G17orf53 

48 

KHDRBSl 

41 

GENPQ 

42 

GENPM 

45 

NBR2 

50 

DTL 

46 

DTL 

44 

PSME3 

45 

MGM6 

50 

GGN5L2 

44 

RPL27 

45 

NBRl 

47 

KIAAOlOl 

40 

VPS25 

43 

NBR2 

50 

MGM6 

42 



NMTl 

41 

NBRl 

49 



PSME3 

50 

NBR2 

50 



RPL27 

41 

PSME3 

50 





RPL27 

50 





SUZ12 

40 





SYNGR4 

41 





XRGG2 

41 


Table 7: Genes selected with high freqnency for the microarray dataset 


6 DISCUSSIONS 

The Hnber loss regression and the qnantile regression have important applications in many 
fields. However, there is a lack of efficient algorithms and pnblicly available software that 
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can fit these models in high-dimensional settings. In this paper, we develop an efficient and 
scalable algorithm for compnting the solntion paths for these models with the elastic-net 
penalty. We also provide an implementation via the R package hqreg pnblicly available on 
CRAN (http: //cloud.r-project. org/package=hqreg). 

APPENDICES 

A Background on Convex Analysis and Properties of New¬ 
ton Derivative 

To derive the KKT conditions (2.3), we recall some backgronnd in convex analysis. We 
also describe some nsefnl properties of Newton derivative. 

For a convex fnnction /, a vector w is called a subgradient of f at z if 

f{x) — f{z)>w~^{x — z), \/x. (A.l) 

The set of all snbgradients of / at c is called the subdifferential, denoted as df{z). For 
example, the snbdifferential of the absolnte valne fnnction has the following form 

I {sign(c)} if c 7 ^ 0, 

d\z\ = ^ (A.2) 

[[-1,1] if^ = 0. 

For convex optimization problems, the necessary and snfficient optimality conditions are 
called the KKT conditions. In the case of nnconstrained optimization, the KKT conditions 
can be stated in terms of Fermat’s rnie (Rockafellar, 1970): for a convex fnnction /, 

0 G df{z*) z* = argmin/(^). (A.3) 

Z 

This holds because by definition 0 G df{z*) if and only if f{z) — f{z*) > O'^{z — z*) = 0 
for every z, that is, 2 ;* = argmin/( 2 ;). 

Z 

A more general result (Combettes and Wajs, 2005) is 

w G df{z) z = PToxf{z -f w), (A.4) 

where Prox/ is the proximity operator for / defined as 

Prox/(z) := argmin-||x — z^\ -|- f{x). 

X 2 
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The second statement can be shown as follows. Applying Fermat’s rule, 
z = Prox/( 2 ; + w) = argmin-||x — — w\\l + f{x), 

X 2 

if and only if there exists s G df{x) such that 

0 = {z — z — w) + s = —w + s, 

that is, 

w = s E df{x). 

It can shown that the proximity operator of the absolute value | ■ | is given in closed 
form by the soft-thresholding operator with threshold 1, i.e. 

Prox|.|(z) = ^( 2 ;) = sgn(z)(| 2 ;| - 1)+. (A.5) 

Then it follows from (A.4) that Sj G 9|/3j| can be expressed as an equation 

f5j - S{l3j -t- Sj) = 0. (A.6) 

According to the Fermat’s rule (A.3), the KKT conditions for the penalized Huber loss 
regression (2.1) are 

-I Ei - 3o - x]P) = 0, 

' Ei - A) - xj^)xij + Xa?j + A(1 - a)Pj = 0, (A.7) 

^SjEd\^j\, j = l,...,p, 

where {(3o,/3) is an optimizer. Rewriting the last row by (A.6), we obtain the KKT condi¬ 
tions as a system of equations (2.3). 

The definition of “Newton derivative" is already given in the main text. Now we provide 
several properties useful for calculating Newton derivatives. The first one is the following 
chain rule for Newton derivatives (Ito and Kunisch, 2008). 

Lemma A.l. If F : W —)■ M"* is continuously Frechet differentiable at z eM} with Jacobian 
Jp and G : M”* —)■ M” is Newton differentiable at F{z) with a Newton derivative Hq- Then 
G o F is Newton differentiable at z with a Newton derivative Hc{F{z + h))Jp{z + h) for h 
sufficiently small. 
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We also provide two other results. 


Lemma A.2. In the following, assume F : M”* —)■ W, G : M"* —>• W, z G M"*, F = 
(Fi,..., Fiy and H = {Hj,..., Hj where Hi G i = l,...,l. 

(i) If F is continuously Frechet differentiable at z, then F is also Newton differentiable 
at z and Jp G VnF{z); 

(a) If F is Newton differentiable at z, then for any integer k > 0 and A G AF is 

Newton differentiable at z; if H ^ VnF{z), then AH G VnAF{z); 

(Hi) If F and G are Newton differentiable at z, then F + G is Newton differentiable at z; 
ifHp G VnF{z), Hg G VnG{z), then Hp + Hq e Vn{F + G){z); 

(iv) F is Newton differentiable at z if and only if Fi,... ,Fi are all Newton differentiable 
at z and if G V nF{z) Hi gVNF i{z), i = 1,... ,1; 

Lemma A.3. A univariate piecewise-smooth real function f is everywhere Newton differ¬ 
entiable, with a Newton derivative H given by 

{ /'(z) if f is differentiable at z, 
rz G if f is not differentiable at z. 

B Derivation of SNA for Penalized Huber Loss Regres¬ 
sion 

Following section 2.3.1, denote 5(2;) = (5(2;i),..., and d{(3o, (I) = {h({yi — fo— 

xJ/3 ),..., h((yn — fo — xfl3)Y, then the KKT conditions (2.3) can be written as (2.10). 
Since the soft-thresholding operator is piecewise linear as shown in (2.7), we define 

"4 = {j : l/^i + Sj\ > 1} , 

B = {j ■■ \/3j + SjI < 1}. 

The set A works as an estimate for the support of f. In fact, if (s, fo,/3) satisfies the 
KKT conditions, then the set A defined on (/3, 's) is exactly the support for /3. This is easy 
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to see: since 'Sj G d\^j\, if 7 ^ 0 then \/3j + Sj| = |/3j + sgn(/3j)| = |/3j| + 1 > 1, otherwise 

I I ~ I j I — ^ • 

We decompose (3 into [3a, (3b and s into sa, sb, and denote Z = {s\, [3]^, (3 q, (3j^, . 

Then KKT conditions (2.10) can be rewritten as 


F{Z) 


And from (2.7) we have 


(3a — S{(3a + Sa) 

(3b — S{(3b + Sb) 
-^l^d 

n 

— ^X\d + \asA + A(1 — a)(3A 

— ^X^d + XasB + A(1 — 0')/3b 


0 . 


{ (3a - S{(3a + Sa) = -Sa + sgn(/3^ + s^), 
(3b — <S{(3b + Sb) = (3b- 


(B.2) 


(B.3) 


Let -07 be as in (2.6), and for brevity denote T = '^{f3o,(3) = bdiag('0..y( yi — (3o — 
xj(3),, i>'y{yn — (3o — Xn(3)). Then the following result gives a proper Newton derivative 
of F(Z). 


Theorem B.l. F{Z) is Newton differentiable for any Z G and 



-I\A\ 

0 

0 

0 

0 


0 

I\B\ 

0 

0 

0 

H{Z) := 

0 

^I^Xb 

In^ln 


0 


Aq;/|a| 

XfmXB 


XlTX^ + A(l-a)J|^l 

0 


0 

XlmXB + \{l-a)I\B\ 


XI^Xa 

XaI\B 


e VnF{Z). 


Furthermore; for any 7 > 0 and a G (0,1), on the set {Z = {s,(3o,(3) : there exists i G 
{1, ... ,n} such that \yi — (3o — xJ(3\ < 7 }, H{Z) is invertible and H{Z)~^ is uniformly 
bounded in spectral norm. 

From Theorems 2.1 and B.l, we immediately obtain the following result. 
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Theorem B.2. Given A, 7 , a G (0,1), define Z and F{Z) as (B.2). Suppose Z solves 
F{Z) = 0 and there exists a neighborhood A/'(Z) such that for any Z G N'iZ) there is an 
i G {1,. .. ,n} that satisfies \yi — fio — xj < 7 , then the Newton-type iteration 

^k+i ^ - H{Z'^)-^F{Z^) 

converges superlinearly to Z provided that \\Z^ — Z \\2 is sufficiently small. 

Now we describe the algorithm in details. The {k + l)-th iteration can be split into two 
steps: 

1. Solve Dk from H{Z'^)Dk = -F{Z^fi 

2. Update Z^^^ = Z^ + D^. 

At the hrst glance, step 1 seems to involve inverting a (2p + 1) x (2p+ 1) matrix, which 
is intractable in high dimensional settings. However, the dehnitions of sets A,B in (B.l) 
motivate an “active set strategy" for dimension reduction. Given the estimates from the 
kth iteration, dehne the active set and its complement by ( 2 . 11 ), d^ = d{l3Q,/3^), 
and Dk = (T>aT, corresponding to Zk. 

Now substituting these identities into step 1 and combining (B.3) we have 

oy = -4.+sgn(7. + 4). 

Dk = -A. 


1 

0 

^0 



1 

_1 




_ IXlpt - A(1 - a)/3y - Aasgri(/)y + _ 

D%, = -4 + ■ 

Combining steps 1 and 2, the {k + l)th iteration of SNA is carried out as follows: 
(i) Update and 

4y = sgn(4 + 4), 

4' = 0 . 
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(ii) Find the direction Dq° for the intercept /3o, and for the active coefficients 




+ A(1 - a)/|Afe| 

ilT^, I iT,t,, ok 



(iii) Update the intercept, the active coefficients, and the inactive subgradients: 



C Proofs 

Here we give proofs of Theorems 3.3, 3.4 in the main text and Lemmas A.2, A.3 and 
Theorem B.l in the appendices. The proof of Theorem 3.1 is similar to that of Theorem 
3.3 and hence omitted. 

Proof of Theorem 3.3. 

Proof. Without loss of generality, assume 9k has exactly one cluster point 9*, i.e. 9k —t 9*. 
Notice that 


|t| - I < Kit) < |f|. 


hence 


//i(e;A)-|</fi(9;A,7)</^(9iA). 

Let 9a be a minimizer of /a( 6'; A), and = min/^(6*; A) = /a(6*a; A), then 

9 


fHi9k'K-,'lk) < A,7fc) < /a(6'a;A) — /^. 


For any e > 0, there exists K such that for k > K, 'fk < 2e, then 


fHi9k; A, 7 fc) > fAi9k] A) - e > /^ - e. 


33 








Hence for k > K, 


/S -« < A) - £ < fl 

Let fc —)■ oo, we have < /a( 6 '*) < /a + e. Since e is arbitrary, /a( 6 **) = /a- ^ 


Proof of Theorem 3.4. 


Proof. Without loss of generality, assume 6k has exactly one cluster point 0*, i.e. 9k 6 **. 
Notice that 

which implies 


7 /h(^;A/ 7 , 7 ) < /s(^; A). 


Let ds be a minimizer of fs{6', A), and fg = mmfs{d-, A) = fs{6s', A), then 

9 

7 fc/H( 6 'fc; A/ 7 fc, 7 fc) < jkfHi9s;X/-fk,lk) < fsi6s;X) = fg- 

Since 6k = {/3 q,/3^) is convergent, = y — /3q1 — XI3^ is convergent too. Then there 
exists M > 0 such that Hr^'Hoo < M. There exists K such that for k > K, jk > M, then 
h.y{rik) = ^Tik'^, and 

7fc/H(6'fc; A/7fc,7fc) = fs{6k;X). 


Hence for k > K, 




Let t: -> 00 , we have /s(9*; A) < /J. Since /s(9*; A) > min/s(9; A) = /J, /s(9*) 

9 


- fO 

“ Is- 


□ 


Proof of Lemma A.2. 

Proof. (i) By assumption, the Jacobian Jp is continuous at Since 
\\F{z + h) - F{z) - Jf{z + h)hh 


\\F{z + h)- F{z) - JFiz)h\\2 + WiJpiz) - Jf{z + h))hh 


< 


^ MiHTLS^W7(£iMi + ||jAc)-jAc + A)ii 


^ 0 


as h —)■ 0, by definition Jp G V nF{z). 
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\\AF{z + h)-AF{z)-AH{z + h)h\\2 < \\A\\\\F{z + h)-F{z)-H{z + h)h\\2 = o(||/i||2), 
hence AH G VnAF{z). 

II(F(^ + h) + G{z + h)) - {F{z) + G{z)) - {Hpiz + h) + Hg{z + h))h\\2 
< \\F{z Fh)- F{z) - Hf{z + h)h\\2 + \\G{z + h) - G{z) - Hg{z + h)h\\2 


hence Hp + Hg G Vn^F + G){z). 


(iv) It can be seen by observing that 

i 

\\F{z + h)- F{z) - H{z + h)h\\l = + h)- Fi{z) - Hi{z + h)hf. 

i=l 

□ 


Proof of Lemma A.3. 


Proof. If / is differentiable at 2 ; with derivative /' defined in its neighborhood, by smooth¬ 
ness assumption and Lemma A.2(i), /' G VNf{z). 

If / is not differentiable at z, by assumption there exists s > 0 such that / is smooth 
on both {z — s,z) and {z,z + s) implying that f{z—) = lim/i_^o- f{z+) = 

lim/i_ 5 .o+ A^+h)-fi^) gxist and 

f{z + h) ^ as h —)■ 0“, 

f{z + h) ^ fi^F) as h —)■ O’*'. 


Hence for any £ > 0, there exists a sufficiently small 5 > 0 such that 

Vx€{z-S,z), < e/2, \f'(x)-f'{z-)\<e/2; 

Vx€(z,z + S). LtMzIMzipifzM < £/2, \f'{x)-f(z+)\<e/2. 


Thus for a; G {z — 6, z), 

\f{x) - f{z) - f{x){x -z)\ \f{x) - f{z) - f{z-){x - z)\ 


< 


X — z 


\x — z 


+ l/'(^-) - 
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and similarly for x G (z,z + 6 ). Define ff(z) as in the lemma, then the above implies 


Ve > 0, 3(5 > 0 s.t. V|x — 5\ < z, 


|/(x) - f{z) -H{x){x- z)\ 


< e. 


\x — z 


In other word, / is Newton differentiable at 2 ; with if G VAr/(^)- 


□ 


In order to prove Theorem B.l, we need the following lemma. 

Lemma C.l. Given a G (0,1) and (do, (3 satisfy \yi — (3o — xj(d\ <7 for some i, then H 3 
in (C.l) is invertible with its inverse uniformly bounded in spectral norm, i.e. 


i^3"i < 

Act 


A(1 — a) 


+ 


A^ax(X^X)^ + n 7 A(l-a) 
A(1 — a) 


1 + 


IXI 


A/n 7 A(l — a) 


Proof. Denote J = , then J is diagonal and idempotent. We have 

= —l^Jln = —(Jln)^(Jln), 

n7 n7 


1 + 


2 ||A 1 | 

\/n'yXa 


and 


ll'^XA{Xj'^XA + X{l-a)I\A\) 'xjm,, 

= —(JlnViJXA) {{JXAV{JXA)+n-fX{l - a)fA\)~' (JXAViJln). 
n'y 

Denote a = Jin, Z = JXa, t = n'jX{l — a), and m = |y4|. Then the LHS becomes 

— ( a^a - a~^Z(Z^Z + tI,n)~^Z^a] . 

n7 V J 

Since \yi — (do — xj(d\ < 7 for some i, we have -0* = ^ > 0, implying that Ju = 1 and 
a~^a > Jj^j = 1. Thns we are gnaranteed that a = Jin is not a zero vector. 

Now apply SVD to Z snch that Z = UDV~^, where Unxn and Vmxm are both orthogonal 
matrices, and Dnxm is a rectangnlar diagonal matrix with non-negative diagonal elements 
d \,..., dnn/\n‘ Hence 


Z{Z'^ Z + tlm)~^z'^ = UDV^iVD^U^UDV^+tInf)-^VD^U^ 

= UDV^ {V{D^D + tIn,)V^) VD'^U^ 

= UDV^V{D^ D+ tIm)~^V^VD'^U^ 

= D+ tIm)-^D'^U^. 
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When n > m, 


D{D^D + tIm)~^D~^ 


diag( 


df 




df + t ’ ' dl^ + t 


, 0 ,..., 0 ), 


and when n < m, 

In either case D{D^D + is p.s.d. with Amax(-D(-D^-D + < 1- 

Next we will derive the upper bound of eigenvalues of the above matrix. First, for any 
eigenvalue d and corresponding nonzero eigenvector u of Z~^Z = XaJXa, we have 



- 

|T 1 

- 

du^ u = XJ^JXau = 

u 

1- 

U 


0 


0 


< JX)u^U, 


hence d < A ma xfW'JX). Then again, for any eigenvalue c and corresponding nonzero 
eigenvector v of X^ JX, we have 


cv~^v = v'^X^ JXv = Jiiv'^XixJv < XixJ V = v'^X^Xv < A ma x(X''~X)n''~n, 

i i 

implying that c < A ma xfW'X). 

Therefore, we have d < JX) < Amax(X''^X). Then since the eigenvalues of 

Z"' Z are the diagonal elements of D, the eigenvalues of D[D^D + tIm)~^CX are bounded 

U Ainax(X^X)^ 

A,aax(XTX)2+r 

Then recall t = n 7 A(l — a) and aX^a > 1, we have 


I’^Tl 

■^n ^ 


> 


> 

> 


ll^XA{X^^XA + X{l-a)IiA\) "xjm 


— (a^a - (U^aynm^D + (U^a)) 

ny ^ ^ 

(X^Xf 


1 / T_^ 

nX A„.,(A'tA')2 +« 

1 nyAll — a) 

X 


ri'^ Amax(XTX)2 + n7A(l - a) 

A(1 — a) 


a^a 


Aa,ax(XTX)2 + n7A(l-a) 

0 . 


n 
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Let 





P 

H31 = 

Xj'^l^ Xj'^XA + X{l-a)I\A\ 

, H32 = 



, -^33 — Aa/|B|- 


Observe that iLgg = Then if H^i is invertible, we have 


^3"' = 




0 




Hence to show is invertible, it suffices to show H^i is invertible. Let 

M = + A(1 - «)J|A|, h = XjTl,, 

and 

K = ilm,, - iI^Xa {XJ^Xa + A(1 - a)I\A\)~' xjmr 
Since k > 0, we have 


= 


M-i + 

K. K 


and it follows that ifg is invertible. 

It can be easily shown that ||6|| = ||6'''|| < ;^^||X||, ||M“^|| < Combine this 

with ^ then similar to (C.3), we have 

2 


i^3“ii < 


+ 


X^UX^Xf + niX{l-a) 


and then 




A(1 — a) A(1 — a) 

A^ax(XTX)2 + n7A(l-«) 


1 + 


ll^ll 


A(1 — a) 


+ 


A(1 — a) 


1 + 


vn7A(l — a) 

il^ll 


y/n'^X{l — a) 


1 + 


2i|.y|| 

y/n'jXa 

□ 


Proof of Theorem B.l. 


Proof. Notice S is piecewise-smooth, then by Lemma A.l, A.3 and Lemma A.2 (iv) Fi{Z) 
is Newton differentiable, and with (B.3) 


-/|A| 0 0 0 0 

0 I\B\ 0 0 0 


G XnF^{Z). 
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Similarly, the Huber loss is also piecewise-smooth, and by Lemma A.l, A.3 and Lemma 
A.2 (ii)-(iv), we have F 2 {Z) and F^{Z) are Newton differentiable and 


0 II'^Xb 0 


G VnF2 {Z), 


0 


ll^ln 


0 


XaI\A\ 

XJ^Xb 

Xj^n 

XjTXA + A(l-a)J|A| 

0 

G VnF^Z) 

0 

X^TXb + A (1 - a)I\B\ 


X^TXa 

XaI\B\ 



Again, by Lemma A.2 (iv), F{Z) is Newton differentiable and 


H = 


-I\A\ 

0 
0 

Aq;/|a| 

0 

Now let 
H,= 


0 

I\B\ 

-T, 


0 

0 

-T 


0 

0 




X\^Xb Xjm^^ XT^XA + A(l-a)J|A| 


X^^XB + X{l-a)IiB\ X^mr 


l\A\ 


0 


0 I\B\ 


, H 2 = 


0 




II^Xb 


0 

0 

0 

0 

XaI\B\ 


e VnF{z). 


In^lr 

-T 


Aa/|A| Xa'LX^OX^'LXs + A(1 — Q!)J|b| 

|T, 




(C.l) 


0 

H,= Xjmn Xj^XA + A(l-a)/|A| 0 

XT^ 1 „ X^^Xa XaI\B\ 

Then it is clear that Hi is invertible. Now if is also invertible, which we show in 
Lemma C.l under a mild condition, then via some algebra we have 

0 

X then 


H-^ = 




(C.2) 


Let g = {gj^gjy e 


\H-^g\\l = WHi^giWl + II - H^^H2Hi^gi + H^^g2\\l 

< ll^r'f Ilf7i|l2 + (||i/3“'lll|L^2||||iLr'||||^7i||2 + IIL^: 
<(lli/r'llllf?ill2 + i|iL3“'llll^2iiiiiLr'llllf?ill2 

< (iii/r^ii + \\H^y + i|i73-^iii|i72iii|i/r^ii)^iii7ii 


-1 


2 2J 


22 ; 


(C.3) 
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which implies 




(C.4) 


Notice i|Xyi|| V||Xs|| < il-^ll- Take Xa, without loss of generality shuffle columns of X 
such that X = ( Xa Xb ); then for any g G such that \\g \\2 = 1, we have 


II^A^I|2=||^ 


9 

0 


I 2 < sup{||Xu ||2 : ||u ||2 = 1} = ||X||, 


implying that i|X^i| = sup {HX^S'lb : hh = 1} < ll^ll- Similarly for Xb- 
Then a similar argument as in (C.3) shows that 

||i^2|| < l + a + 2||X||2. 


(C.5) 


Combining (C.4), (C.5) with results of Lemma C.l under its condition, and observing 
that ||Lff ^11 = 1, we obtain the uniform boundedness of H in spectral norm, i.e., 

2 ' 


\H 


-ii 


< 1 + 

X (1 + 


1 

^ + 


\a \ A(1 — a) 

2||X|| 


+ 


X^^{X^Xf + n^X{l-a) 


A(1 — a) 


1 + 


ll^ll 


^/n'yX{l — a) 


y/n'jXa 


(2 + a + 2||X||^). 


□ 


REFERENCES 

Breheny, P. and J. Huang (2011). Coordinate descent algorithms for nonconvex penalized regres¬ 
sion, with applications to biological feature selection. The Annals of Applied Statistics 5(1), 
232-253. 

Biihlmann, P., M. Kalisch, and L. Meier (2014). High-dimensional statistics with a view toward 
applications in biology. Annual Review of Statistics and Rs Application 1, 255-278. 

Chen, X., Z. Nashed, and L. Qi (2000). Smoothing methods and semismooth methods for nondif- 
ferentiable operator equations. SIAM Journal on Numerical Analysis 55(4), 1200-1216. 

Clarke, F. H. (1983). Optimization and Nonsmooth Analysis. Wiley. 

Combettes, P. L. and V. R. Wajs (2005). Signal recovery by proximal forward-backward splitting. 
Multiscale Modeling & Simulation .^(4), 1168-1200. 

Friedman, J., T. Hastie, H. Hofling, and R. Tibshirani (2007). Pathwise coordinate optimization. 
The Annals of Applied Statistics 7(2), 302-332. 

Friedman, J., T. Hastie, and R. Tibshirani (2010). Regularization paths for generalized linear 
models via coordinate descent. Journal of Statistical Software 55(1), 1-22. 


40 








Holland, P. W. and R. E. Welsch (1977). Robust regression using iteratively reweighted least- 
squares. Communications in Statistics-Theory and Methods 6{9), 813-827. 

Huber, P. J. (1973). Robust regression: Asymptotics, conjectures and monte carlo. The Annals 
of Statisties i(5), 799-821. 

Ito, K. and K. Kunisch (2008). Lagrange Multiplier Approaeh to Variational Problems and Appli- 
eations. Philadelphia, PA: SIAM. 

Koenker, R. and G. Bassett Jr (1978). Regression quantiles. Eeonometrica .^^(1), 33-50. 

Koenker, R. and J. A. Machado (1999). Goodness of fit and related inference processes for quantile 
regression. Journal of the Ameriean Statistical Association 9.^ (448), 1296-1310. 

Mifflin, R. (1977). Semismooth and semiconvex functions in constrained optimization. SIAM 
Journal on Control and Optimization i5(6), 959-972. 

Osborne, M. and B. Turlach (2011). A homotopy algorithm for the quantile regression lasso and 
related piecewise linear problems. Journal of Computational and Craphical Statistics 20 {A) ^ 
972-987. 

Peng, B. and L. Wang (2015). An iterative coordinate descent algorithm for high-dimensional non- 
convex penalized quantile regression. Journal of Computational and Craphical Statistics 24(3), 
676-694. 

Portnoy, S., R. Koenker, et al. (1997). The gaussian hare and the laplacian tortoise: Gomputability 
of squared-error versus absolute-error estimators. Statistical Science 12{4), 279-300. 

Qi, L. and J. Sun (1993). A nonsmooth version of newton’s method. Mathematieal Program¬ 
ming 55(1-3), 353-367. 

Rockafellar, R. T. (1970). Convex Analysis. Princeton, NJ: Princeton University Press. 

Simon, N., J. Friedman, T. Hastie, and R. Tibshirani (2011). Regularization paths for cox’s 
proportional hazards model via coordinate descent. Journal of Statistieal Software 39{5), 1-13. 

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal 
Statistical Society. Series B (Methodologieal) 55(1), 267-288. 

Tibshirani, R., J. Bien, J. Friedman, T. Hastie, N. Simon, J. Taylor, and R. J. Tibshirani (2012). 
Strong rules for discarding predictors in lasso-type problems. Journal of the Royal Statistieal 
Society: Series B (Statistieal Methodology) 74{2), 245-266. 

Tseng, P. (2001). Gonvergence of a block coordinate descent method for nondifferentiable mini¬ 
mization. Journal of Optimization Theory and Applieations 109{3), 475-494. 

Wang, L., Y. Wu, and R. Li (2012). Quantile regression for analyzing heterogeneity in ultra-high 
dimension. Journal of the Ameriean Statistieal Association 757(497), 214-222. 

Wu, T. T. and K. Lange (2008). Goordinate descent algorithms for lasso penalized regression. The 
Annals of Applied Statisties 2{1), 224-244. 

Zou, H. and T. Hastie (2005). Regularization and variable selection via the elastic net. Journal 
of the Royal Statistieal Society: Series B (Statistical Methodology) 67(2), 301-320. 


41 



